---
title: "Assignment 5: Space-Time Prediction of Indego Bike Share Demand"
author: "Jed Chew and Muhammad Al Abbas"
date: 12/1/2025
format:
html:
code-fold: show
code-tools: true
toc: true
toc-depth: 3
toc-location: left
theme: cosmo
embed-resources: true
editor: visual
execute:
warning: false
message: false
---
# The Rebalancing Challenge in Philadelphia
Philadelphia's Indego bike share system faces the same operational challenge as every bike share system: **rebalancing bikes to meet anticipated demand**.
Imagine you're an Indego operations manager at 6:00 AM on a Monday morning. You have: - 200 stations across Philadelphia - Limited trucks and staff for moving bikes - 2-3 hours before morning rush hour demand peaks - **The question:** Which stations will run out of bikes by 8:30 AM?
This lab will teach you to build predictive models that forecast bike share demand across **space** (different stations) and **time** (different hours) to help solve this operational problem.
## Learning Objectives
By the end of this assignment, you will be able to:
1. **Understand panel data structure** for space-time analysis
2. **Create temporal lag variables** to capture demand persistence
3. **Build multiple predictive models** with increasing complexity
4. **Validate models temporally** (train on past, test on future)
5. **Analyze prediction errors** in both space and time
6. **Engineer new features** based on error patterns
7. **Critically evaluate** when prediction errors matter most
```{r setup}
#| message: false
#| warning: false
#| results: 'hide'
library(tidyverse)
library(lubridate)
library(sf)
library(tigris)
library(tidycensus)
library(here)
library(riem) # For Philadelphia weather from ASOS stations
# Visualization
library(viridis)
library(gridExtra)
library(knitr)
library(kableExtra)
options(scipen = 999)
```
```{r themes}
plotTheme <- theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10),
plot.caption = element_text(size = 8),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title = element_text(size = 11, face = "bold"),
panel.background = element_blank(),
panel.grid.major = element_line(colour = "#D0D0D0", size = 0.2),
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
legend.position = "right"
)
mapTheme <- theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10),
plot.caption = element_text(size = 8),
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_line(colour = 'transparent'),
panel.grid.minor = element_blank(),
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, 'cm'),
legend.key.height = unit(1, "cm"),
legend.key.width = unit(0.2, "cm")
)
palette5 <- c("#eff3ff", "#bdd7e7", "#6baed6", "#3182bd", "#08519c")
```
```{r census key}
#| include: false
census_api_key("fe841b7ef0aa73d9579f0517bd1c8f26d33c789b")
```
## Part 1: Data Import & Preparation
### 1.1 \| Load Indego Trip Data
**Data Dictionary**
- **trip_id:** Locally unique integer that identifies the trip
- **duration:** Length of trip in minutes
- **start_time & end_time**
- **start_station; start_lat; start_lon**
- **end_station; end_lat; end_lon**
- **bike_id:** Locally unique integer that identifies the bike
- **plan_duration:** The number of days that the plan the passholder is using entitles them to ride; 0 is used for a single ride plan (Walk-up)
- **trip_route_category:** “Round Trip” for trips starting and ending at the same station or “One Way” for all other trips
- **passholder_type:** The name of the passholder’s plan
- **bike_type:** The kind of bike used on the trip, including standard pedal-powered bikes or electric assist bikes
```{r, include = FALSE}
getwd()
```
```{r load-indego}
indego <- read_csv("data/indego-trips-2025-q1.csv")
indego_q2 <- read_csv("data/indego-trips-2025-q2.csv")
indego_q3 <- read_csv("data/indego-trips-2025-q3.csv")
```
```{r, glimpse-indego, include = FALSE}
# Quick look at the data
head(indego)
head(indego_q2)
head(indego_q3)
```
### 1.2 \| Create Time Bins
- Helper function to create time bins – aggregate trips into hourly intervals for panel data structure
```{r time-bins}
indego <- indego %>%
mutate(
# Parse datetime
start_datetime = mdy_hm(start_time),
end_datetime = mdy_hm(end_time),
# Create hourly bins
interval60 = floor_date(start_datetime, unit = "hour"),
# Extract time features
week = week(interval60),
month = month(interval60, label = TRUE),
dotw = wday(interval60, label = TRUE),
hour = hour(interval60),
date = as.Date(interval60),
# Create useful indicators
weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
)
# Look at temporal features
head(indego %>% select(start_datetime, interval60, week, dotw, hour, weekend))
```
### 1.3 \| Exploratory Data Analysis
**Daily Trip Counts**
```{r trips_over_time}
daily_trips <- indego %>%
group_by(date) %>%
summarize(trips = n())
ggplot(daily_trips, aes(x = date, y = trips)) +
geom_line(color = "#3182bd", linewidth = 1) +
geom_smooth(se = FALSE, color = "red", linetype = "dashed") +
labs(
title = "Indego Daily Ridership - Q1 2025",
subtitle = "Winter demand patterns in Philadelphia",
x = "Date",
y = "Daily Trips",
caption = "Source: Indego bike share"
) +
plotTheme
```
**Hourly Patterns**
```{r hourly_patterns}
# Average trips by hour and day type
hourly_patterns <- indego %>%
group_by(hour, weekend) %>%
summarize(avg_trips = n() / n_distinct(date)) %>%
mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))
ggplot(hourly_patterns, aes(x = hour, y = avg_trips, color = day_type)) +
geom_line(linewidth = 1.2) +
scale_color_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
labs(
title = "Average Hourly Ridership Patterns",
subtitle = "Clear commute patterns on weekdays",
x = "Hour of Day",
y = "Average Trips per Hour",
color = "Day Type"
) +
plotTheme
```
**Top Stations**
```{r}
# Most popular origin stations
top_stations <- indego %>%
count(start_station, start_lat, start_lon, name = "trips") %>%
arrange(desc(trips)) %>%
head(20)
kable(top_stations,
caption = "Top 20 Indego Stations by Trip Origins",
format.args = list(big.mark = ",")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
::: callout-note
## Observed Patterns for Q1 2025
- Lorem Ipsum
:::
------------------------------------------------------------------------
## Part 2: Philadelphia Census Data
### 2.1 \| Load Philadelphia Census Data
```{r load-census}
# Get Philadelphia census tracts
philly_census <- get_acs(
geography = "tract",
variables = c(
"B01003_001", # Total population
"B19013_001", # Median household income
"B08301_001", # Total commuters
"B08301_010", # Commute by transit
"B02001_002", # White alone
"B25077_001" # Median home value
),
state = "PA",
county = "Philadelphia",
year = 2022,
geometry = TRUE,
output = "wide"
) %>%
rename(
Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Total_Commuters = B08301_001E,
Transit_Commuters = B08301_010E,
White_Pop = B02001_002E,
Med_Home_Value = B25077_001E
) %>%
mutate(
Percent_Taking_Transit = (Transit_Commuters / Total_Commuters) * 100,
Percent_White = (White_Pop / Total_Pop) * 100
) %>%
st_transform(crs = 4326) # WGS84 for lat/lon matching
```
### 2.2 \| Map Philadelphia Context
```{r map-philly}
# Map median income
ggplot() +
geom_sf(data = philly_census, aes(fill = Med_Inc), color = NA) +
scale_fill_viridis(
option = "viridis",
name = "Median\nIncome",
labels = scales::dollar
) +
labs(
title = "Philadelphia Median Household Income by Census Tract",
subtitle = "Context for understanding bike share demand patterns"
) +
# Stations
geom_point(
data = indego,
aes(x = start_lon, y = start_lat),
color = "red", size = 0.25, alpha = 0.6
) +
mapTheme
```
### 2.3 \| Join Census Data to Stations
```{r visualize-fishnet}
# Create sf object for stations
stations_sf <- indego %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326)
# Spatial join to get census tract for each station
stations_census <- st_join(stations_sf, philly_census, left = TRUE) %>%
st_drop_geometry()
```
**Left-Join Census Data to Q1 Indego Data**
```{r}
# Look at the result - investigate whether all of the stations joined to census data -- according to the map above there are stations in non-residential tracts.
stations_for_map <- indego %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
left_join(
stations_census %>% select(start_station, Med_Inc),
by = "start_station"
) %>%
mutate(has_census = !is.na(Med_Inc))
# Add back to trip data
indego_census <- indego %>%
left_join(
stations_census %>%
select(start_station, Med_Inc, Percent_Taking_Transit,
Percent_White, Total_Pop),
by = "start_station"
)
# Prepare data for visualization
stations_for_map <- indego %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
left_join(
stations_census %>% select(start_station, Med_Inc),
by = "start_station"
) %>%
mutate(has_census = !is.na(Med_Inc))
# Create the map showing problem stations
ggplot() +
geom_sf(data = philly_census, aes(fill = Med_Inc), color = "white", size = 0.1) +
scale_fill_viridis(
option = "viridis",
name = "Median\nIncome",
labels = scales::dollar,
na.value = "grey90"
) +
# Stations with census data (small grey dots)
geom_point(
data = stations_for_map %>% filter(has_census),
aes(x = start_lon, y = start_lat),
color = "grey30", size = 1, alpha = 0.6
) +
# Stations WITHOUT census data (red X marks the spot)
geom_point(
data = stations_for_map %>% filter(!has_census),
aes(x = start_lon, y = start_lat),
color = "red", size = 1, shape = 4, stroke = 1.5
) +
labs(
title = "Philadelphia Median Household Income by Census Tract",
subtitle = "Indego stations shown (RED = no census data match)",
caption = "Red X marks indicate stations that didn't join to census tracts"
) +
mapTheme
```
::: callout-note
## Observed Patterns
- Lorem Ipsum
:::
### 2.4 \| Dealing with Missing Data
We need to decide what to do with the non-residential bike share stations. For this example, we are going to remove them.
This is not necessarily the right way to do things always, but for the sake of simplicity, we are narrowing our scope to only stations in residential neighborhoods. We might opt to create a separate model for non-residential stations.
```{r}
# Identify which stations to keep
valid_stations <- stations_census %>%
filter(!is.na(Med_Inc)) %>%
pull(start_station)
# Filter trip data to valid stations only
indego_census <- indego %>%
filter(start_station %in% valid_stations) %>%
left_join(
stations_census %>%
select(start_station, Med_Inc, Percent_Taking_Transit,
Percent_White, Total_Pop),
by = "start_station"
)
```
------------------------------------------------------------------------
## Part 3: Get Weather Data
### 3.1 \| Weather Data for Q1 2025
```{r get-wea}
#| message: false
#| warning: false
# Get weather from Philadelphia International Airport (KPHL)
# This covers Q1 2025: January 1 - March 31
weather_data <- riem_measures(
station = "PHL", # Philadelphia International Airport
date_start = "2025-01-01",
date_end = "2025-03-31"
)
# Process weather data
weather_processed <- weather_data %>%
mutate(
interval60 = floor_date(valid, unit = "hour"),
Temperature = tmpf, # Temperature in Fahrenheit
Precipitation = ifelse(is.na(p01i), 0, p01i), # Hourly precip in inches
Wind_Speed = sknt # Wind speed in knots
) %>%
select(interval60, Temperature, Precipitation, Wind_Speed) %>%
# updated in-class code to only keep the first row for each hour
distinct(interval60, .keep_all = TRUE)
# Check for missing hours and interpolate if needed
weather_complete <- weather_processed %>%
complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
fill(Temperature, Precipitation, Wind_Speed, .direction = "down")
# Look at the weather
summary(weather_complete %>% select(Temperature, Precipitation, Wind_Speed))
```
### 3.2 \| Weather Data for Q2 2025
```{r}
weather_q2 <- riem_measures(
station = "PHL", # Philadelphia International Airport
date_start = "2025-04-01",
date_end = "2025-06-30"
)
# Process weather data
weather_q2 <- weather_q2 %>%
mutate(
interval60 = floor_date(valid, unit = "hour"),
Temperature = tmpf, # Temperature in Fahrenheit
Precipitation = ifelse(is.na(p01i), 0, p01i), # Hourly precip in inches
Wind_Speed = sknt # Wind speed in knots
) %>%
select(interval60, Temperature, Precipitation, Wind_Speed) %>%
# updated in-class code to only keep the first row for each hour
distinct(interval60, .keep_all = TRUE)
# Check for missing hours and interpolate if needed
weather_q2_complete <- weather_q2 %>%
complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
fill(Temperature, Precipitation, Wind_Speed, .direction = "down")
# Look at the weather
summary(weather_q2_complete %>% select(Temperature, Precipitation, Wind_Speed))
```
### 3.3 \| Weather Data for Q3 2025
```{r}
weather_q3 <- riem_measures(
station = "PHL", # Philadelphia International Airport
date_start = "2025-07-01",
date_end = "2025-09-30"
)
# Process weather data
weather_q3 <- weather_q3 %>%
mutate(
interval60 = floor_date(valid, unit = "hour"),
Temperature = tmpf, # Temperature in Fahrenheit
Precipitation = ifelse(is.na(p01i), 0, p01i), # Hourly precip in inches
Wind_Speed = sknt # Wind speed in knots
) %>%
select(interval60, Temperature, Precipitation, Wind_Speed) %>%
# updated in-class code to only keep the first row for each hour
distinct(interval60, .keep_all = TRUE)
# Check for missing hours and interpolate if needed
weather_q3_complete <- weather_q3 %>%
complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
fill(Temperature, Precipitation, Wind_Speed, .direction = "down")
# Look at the weather
summary(weather_q3_complete %>% select(Temperature, Precipitation, Wind_Speed))
```
### 3.4 \| Visualize Weather Patterns
```{r}
ggplot(weather_complete, aes(x = interval60, y = Temperature)) +
geom_line(color = "#3182bd", alpha = 0.7) +
geom_smooth(se = FALSE, color = "red") +
labs(
title = "Philadelphia Temperature - Q1 2025",
subtitle = "Winter to Early Spring Transition",
x = "Date",
y = "Temperature (°F)"
) +
plotTheme
```
```{r}
ggplot(weather_q2_complete, aes(x = interval60, y = Temperature)) +
geom_line(color = "#3182bd", alpha = 0.7) +
geom_smooth(se = FALSE, color = "red") +
labs(
title = "Philadelphia Temperature - Q2 2025",
subtitle = "Spring to Summer Transition",
x = "Date",
y = "Temperature (°F)"
) +
plotTheme
```
```{r}
ggplot(weather_q3_complete, aes(x = interval60, y = Temperature)) +
geom_line(color = "#3182bd", alpha = 0.7) +
geom_smooth(se = FALSE, color = "red") +
labs(
title = "Philadelphia Temperature - Q3 2025",
subtitle = "Summer to Fall Transition",
x = "Date",
y = "Temperature (°F)"
) +
plotTheme
```
------------------------------------------------------------------------
## Part 4: Create Space-Time Panel
### 4.1 \| Aggregate Trips to Station-Hour Level
**Q1 2025**
```{r}
# Count trips by station-hour
trips_panel <- indego_census %>%
group_by(interval60, start_station, start_lat, start_lon,
Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop) %>%
summarize(Trip_Count = n()) %>%
ungroup()
# How many station-hour observations?
nrow(trips_panel)
# How many unique stations?
length(unique(trips_panel$start_station))
# How many unique hours?
length(unique(trips_panel$interval60))
```
### 4.2 \| Create Complete Panel Structure
Not every station has trips every hour. We need a **complete panel** where every station-hour combination exists (even if Trip_Count = 0).
```{r}
# Calculate expected panel size
n_stations <- length(unique(trips_panel$start_station))
n_hours <- length(unique(trips_panel$interval60))
expected_rows <- n_stations * n_hours
cat("Expected panel rows:", format(expected_rows, big.mark = ","), "\n")
cat("Current rows:", format(nrow(trips_panel), big.mark = ","), "\n")
cat("Missing rows:", format(expected_rows - nrow(trips_panel), big.mark = ","), "\n")
# Create complete panel
study_panel <- expand.grid(
interval60 = unique(trips_panel$interval60),
start_station = unique(trips_panel$start_station)
) %>%
# Join trip counts
left_join(trips_panel, by = c("interval60", "start_station")) %>%
# Replace NA trip counts with 0
mutate(Trip_Count = replace_na(Trip_Count, 0))
# Fill in station attributes (they're the same for all hours)
station_attributes <- trips_panel %>%
group_by(start_station) %>%
summarize(
start_lat = first(start_lat),
start_lon = first(start_lon),
Med_Inc = first(Med_Inc),
Percent_Taking_Transit = first(Percent_Taking_Transit),
Percent_White = first(Percent_White),
Total_Pop = first(Total_Pop)
)
study_panel <- study_panel %>%
left_join(station_attributes, by = "start_station")
# Verify we have complete panel
cat("Complete panel rows:", format(nrow(study_panel), big.mark = ","), "\n")
```
**Add Time Features**
```{r}
study_panel <- study_panel %>%
mutate(
week = week(interval60),
month = month(interval60, label = TRUE),
dotw = wday(interval60, label = TRUE),
hour = hour(interval60),
date = as.Date(interval60),
weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
)
```
**Join Weather Data**
```{r}
study_panel <- study_panel %>%
left_join(weather_complete, by = "interval60")
# Check for missing values
summary(study_panel %>% select(Trip_Count, Temperature, Precipitation))
```
```{r}
study_panel
```
------------------------------------------------------------------------
## Part 5: Create Temporal Lag Variables
Temporal Persistence: past demand predicts future demand
### 5.1 \| Create Temporal Lags
```{r}
# Sort by station and time
study_panel <- study_panel %>%
arrange(start_station, interval60)
# Create lag variables WITHIN each station
study_panel <- study_panel %>%
group_by(start_station) %>%
mutate(
lag1Hour = lag(Trip_Count, 1),
lag2Hours = lag(Trip_Count, 2),
lag3Hours = lag(Trip_Count, 3),
lag12Hours = lag(Trip_Count, 12),
lag1day = lag(Trip_Count, 24)
) %>%
ungroup()
# Remove rows with NA lags (first 24 hours for each station)
study_panel_complete <- study_panel %>%
filter(!is.na(lag1day))
cat("Rows after removing NA lags:", format(nrow(study_panel_complete), big.mark = ","), "\n")
```
```{r}
# Sort by station and time
study_panel <- study_panel %>%
arrange(start_station, interval60)
# Create lag variables WITHIN each station
study_panel <- study_panel %>%
group_by(start_station) %>%
mutate(
lag1Hour = lag(Trip_Count, 1),
lag2Hours = lag(Trip_Count, 2),
lag3Hours = lag(Trip_Count, 3),
lag12Hours = lag(Trip_Count, 12),
lag1day = lag(Trip_Count, 24)
) %>%
ungroup()
# Remove rows with NA lags (first 24 hours for each station)
study_panel_complete <- study_panel %>%
filter(!is.na(lag1day))
cat("Rows after removing NA lags:", format(nrow(study_panel_complete), big.mark = ","), "\n")
```
### 5.2 \| Visualize Lag Correlations
```{r}
# Sample one station to visualize
example_station <- study_panel_complete %>%
filter(start_station == first(start_station)) %>%
head(168) # One week
# Plot actual vs lagged demand
ggplot(example_station, aes(x = interval60)) +
geom_line(aes(y = Trip_Count, color = "Current"), linewidth = 1) +
geom_line(aes(y = lag1Hour, color = "1 Hour Ago"), linewidth = 1, alpha = 0.7) +
geom_line(aes(y = lag1day, color = "24 Hours Ago"), linewidth = 1, alpha = 0.7) +
scale_color_manual(values = c(
"Current" = "#08519c",
"1 Hour Ago" = "#3182bd",
"24 Hours Ago" = "#6baed6"
)) +
labs(
title = "Temporal Lag Patterns at One Station",
subtitle = "Past demand predicts future demand",
x = "Date-Time",
y = "Trip Count",
color = "Time Period"
) +
plotTheme
```
------------------------------------------------------------------------
## Part 6: Temporal Train/Test Split
Approach: Train on Past Data and Test on Future Data
```{r temporal-split}
# Split by week
# Q1 has weeks 1-13 (Jan-Mar)
# Train on weeks 1-9 (Jan 1 - early March)
# Test on weeks 10-13 (rest of March)
# Which stations have trips in BOTH early and late periods?
early_stations <- study_panel_complete %>%
filter(week < 10) %>%
filter(Trip_Count > 0) %>%
distinct(start_station) %>%
pull(start_station)
late_stations <- study_panel_complete %>%
filter(week >= 10) %>%
filter(Trip_Count > 0) %>%
distinct(start_station) %>%
pull(start_station)
# Keep only stations that appear in BOTH periods
common_stations <- intersect(early_stations, late_stations)
# Filter panel to only common stations
study_panel_complete <- study_panel_complete %>%
filter(start_station %in% common_stations)
# NOW create train/test split
train <- study_panel_complete %>%
filter(week < 10)
test <- study_panel_complete %>%
filter(week >= 10)
cat("Training observations:", format(nrow(train), big.mark = ","), "\n")
cat("Testing observations:", format(nrow(test), big.mark = ","), "\n")
cat("Training date range:", min(train$date), "to", max(train$date), "\n")
cat("Testing date range:", min(test$date), "to", max(test$date), "\n")
```
------------------------------------------------------------------------
## Part 7: Build 5 Predictive Models
We'll build 5 models with increasing complexity to see what improves predictions.
### Model 1: Baseline (Time + Weather)
```{r model-1}
# Create day of week factor with treatment (dummy) coding
train <- train %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Set contrasts to treatment coding (dummy variables)
contrasts(train$dotw_simple) <- contr.treatment(7)
model1 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation,
data = train
)
summary(model1)
```
The model uses Monday as the baseline. Each coefficient represents the difference in expected trips per station-hour compared to Monday. For example, dow_simple2 = Tuesday.
**Weekday Pattern (Tue-Fri):**
- All weekdays have positive coefficients (0.029 to 0.052)
- Tuesday has the highest weekday effect (+0.052)
- Weekdays likely benefit from concentrated commuting patterns
**Weekend Pattern (Sat-Sun):**
- Both weekend days have negative coefficients (-0.061 and -0.065)
- This means fewer trips per station-hour than Monday
### Model 2: Add Temporal Lags
```{r model-2}
model2 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day,
data = train
)
summary(model2)
```
### Model 3: Add Demographics
```{r model-3}
model3 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y,
data = train
)
cat("Model 3 R-squared:", summary(model3)$r.squared, "\n")
cat("Model 3 Adj R-squared:", summary(model3)$adj.r.squared, "\n")
```
### Model 4: Add Station Fixed Effects
Station fixed effects capture Baseline differences in demand across stations – some are just busier than others.
```{r model-4}
model4 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y +
as.factor(start_station),
data = train
)
cat("Model 4 R-squared:", summary(model4)$r.squared, "\n")
cat("Model 4 Adj R-squared:", summary(model4)$adj.r.squared, "\n")
```
### Model 5: Add Rush Hour Interaction
```{r model-5}
model5 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day + rush_hour + as.factor(month) +
Med_Inc.x + Percent_Taking_Transit.y + Percent_White.y +
as.factor(start_station) +
rush_hour * weekend, # Rush hour effects different on weekends
data = train
)
cat("Model 5 R-squared:", summary(model5)$r.squared, "\n")
cat("Model 5 Adj R-squared:", summary(model5)$adj.r.squared, "\n")
```
------------------------------------------------------------------------
## Part 8: Model Evaluation
### 8.1 \| Predictions on Test Setand MAE
```{r}
# Create day of week factor with treatment (dummy) coding
test <- test %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Set contrasts to treatment coding (dummy variables)
contrasts(test$dotw_simple) <- contr.treatment(7)
test <- test %>%
mutate(
pred1 = predict(model1, newdata = test),
pred2 = predict(model2, newdata = test),
pred3 = predict(model3, newdata = test),
pred4 = predict(model4, newdata = test),
pred5 = predict(model5, newdata = test)
)
# Calculate MAE for each model
mae_results <- data.frame(
Model = c(
"1. Time + Weather",
"2. + Temporal Lags",
"3. + Demographics",
"4. + Station FE",
"5. + Rush Hour Interaction"
),
MAE = c(
mean(abs(test$Trip_Count - test$pred1), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred2), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred3), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred4), na.rm = TRUE),
mean(abs(test$Trip_Count - test$pred5), na.rm = TRUE)
)
)
kable(mae_results,
digits = 2,
caption = "Mean Absolute Error by Model (Test Set)",
col.names = c("Model", "MAE (trips)")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
### 8.2 \| Visualize Model Comparison
```{r}
ggplot(mae_results, aes(x = reorder(Model, -MAE), y = MAE)) +
geom_col(fill = "#3182bd", alpha = 0.8) +
geom_text(aes(label = round(MAE, 2)), vjust = -0.5) +
labs(
title = "Model Performance Comparison",
subtitle = "Lower MAE = Better Predictions",
x = "Model",
y = "Mean Absolute Error (trips)"
) +
plotTheme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
------------------------------------------------------------------------
## Part 9: Space-Time Error Analysis using Best Model
We conduct space-time error analysis using Model 2 which has the lowest MAE for Q1 2025 data.
### 9.1 \| Observed vs Predicted
```{r obs-vs-pred}
test <- test %>%
mutate(
error = Trip_Count - pred2,
abs_error = abs(error),
time_of_day = case_when(
hour < 7 ~ "Overnight",
hour >= 7 & hour < 10 ~ "AM Rush",
hour >= 10 & hour < 15 ~ "Mid-Day",
hour >= 15 & hour <= 18 ~ "PM Rush",
hour > 18 ~ "Evening"
)
)
# Scatter plot by time and day type
ggplot(test, aes(x = Trip_Count, y = pred2)) +
geom_point(alpha = 0.2, color = "#3182bd") +
geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
geom_smooth(method = "lm", se = FALSE, color = "darkgreen") +
facet_grid(weekend ~ time_of_day) +
labs(
title = "Observed vs. Predicted Bike Trips",
subtitle = "Model 2 performance by time period",
x = "Observed Trips",
y = "Predicted Trips",
caption = "Red line = perfect predictions; Green line = actual model fit"
) +
plotTheme
```
### 9.2 \| Spatial Error Patterns
- Create error maps
- Identify neighborhoods with high errors
- Hypothesize why (missing features? different demand patterns?)
**Calculate MAE by Station**
```{r spatial-errors}
# Calculate MAE by station
station_errors <- test %>%
group_by(start_station, start_lat.x, start_lon.y) %>%
summarize(
MAE = mean(abs_error, na.rm = TRUE),
avg_demand = mean(Trip_Count, na.rm = TRUE),
.groups = "drop"
) %>%
filter(!is.na(start_lat.x), !is.na(start_lon.y))
## Create Two Maps Side-by-Side with Proper Legends (sorry these maps are ugly)
# Calculate station errors
station_errors <- test %>%
filter(!is.na(pred2)) %>%
group_by(start_station, start_lat.x, start_lon.y) %>%
summarize(
MAE = mean(abs(Trip_Count - pred2), na.rm = TRUE),
avg_demand = mean(Trip_Count, na.rm = TRUE),
.groups = "drop"
) %>%
filter(!is.na(start_lat.x), !is.na(start_lon.y))
```
**Visualize**
```{r}
# Map 1: Prediction Errors
p1 <- ggplot() +
geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
geom_point(
data = station_errors,
aes(x = start_lon.y, y = start_lat.x, color = MAE),
size = 3.5,
alpha = 0.7
) +
scale_color_viridis(
option = "plasma",
name = "MAE\n(trips)",
direction = -1,
breaks = c(0.5, 1.0, 1.5), # Fewer, cleaner breaks
labels = c("0.5", "1.0", "1.5")
) +
labs(title = "Prediction Errors",
subtitle = "Higher in Center City") +
mapTheme +
theme(
legend.position = "right",
legend.title = element_text(size = 10, face = "bold"),
legend.text = element_text(size = 9),
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10)
) +
guides(color = guide_colorbar(
barwidth = 1.5,
barheight = 12,
title.position = "top",
title.hjust = 0.5
))
# Map 2: Average Demand
p2 <- ggplot() +
geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
geom_point(
data = station_errors,
aes(x = start_lon.y, y = start_lat.x, color = avg_demand),
size = 3.5,
alpha = 0.7
) +
scale_color_viridis(
option = "viridis",
name = "Avg\nDemand",
direction = -1,
breaks = c(0.5, 1.0, 1.5, 2.0, 2.5), # Clear breaks
labels = c("0.5", "1.0", "1.5", "2.0", "2.5")
) +
labs(title = "Average Demand",
subtitle = "Trips per station-hour") +
mapTheme +
theme(
legend.position = "right",
legend.title = element_text(size = 10, face = "bold"),
legend.text = element_text(size = 9),
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10)
) +
guides(color = guide_colorbar(
barwidth = 1.5,
barheight = 12,
title.position = "top",
title.hjust = 0.5
))
# Combine
grid.arrange(
p1, p2,
ncol = 2
)
p1
```
### 9.3 \| Temporal Error Patterns
- When are errors highest?
- Do certain hours/days have systematic under/over-prediction?
- Are there seasonal patterns?
```{r temporal-errors}
# MAE by time of day and day type
temporal_errors <- test %>%
group_by(time_of_day, weekend) %>%
summarize(
MAE = mean(abs_error, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))
ggplot(temporal_errors, aes(x = time_of_day, y = MAE, fill = day_type)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
labs(
title = "Prediction Errors by Time Period",
subtitle = "When is the model struggling most?",
x = "Time of Day",
y = "Mean Absolute Error (trips)",
fill = "Day Type"
) +
plotTheme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
### 9.4 \| Demographic Patterns
- Relate errors to census characteristics
- Are certain communities systematically harder to predict?
- What are the equity implications?
```{r demographic-errors}
# Join demographic data to station errors
station_errors_demo <- station_errors %>%
left_join(
station_attributes %>% select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White),
by = "start_station"
) %>%
filter(!is.na(Med_Inc))
# Create plots
p1 <- ggplot(station_errors_demo, aes(x = Med_Inc, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
scale_x_continuous(labels = scales::dollar) +
labs(title = "Errors vs. Median Income", x = "Median Income", y = "MAE") +
plotTheme
p2 <- ggplot(station_errors_demo, aes(x = Percent_Taking_Transit, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Errors vs. Transit Usage", x = "% Taking Transit", y = "MAE") +
plotTheme
p3 <- ggplot(station_errors_demo, aes(x = Percent_White, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Errors vs. Race", x = "% White", y = "MAE") +
plotTheme
grid.arrange(p1, p2, p3, ncol = 2)
```
------------------------------------------------------------------------
## Part 10: Replicate with Q2 2025
1. **Download data** from: <https://www.rideindego.com/about/data/>
2. **Adapt this code** to work with your quarter:
- Update date ranges for weather data
- Check for any data structure changes
- Create the same 5 models
- Calculate MAE for each model
3. **Compare results** to Q1 2025:
- How do MAE values compare? Why might they differ?
- Are temporal patterns different (e.g., summer vs. winter)?
- Which features are most important in your quarter?
------------------------------------------------------------------------
## Part 11: Error Analysis for Q2 2025 vs Q1 2025
```{r}
```
------------------------------------------------------------------------
## Part 12: Feature Engineering & model improvement
Based on your error analysis, add 2-3 NEW features to improve the model:
**Potential features to consider:**
*Temporal features:*
- Holiday indicators (Memorial Day, July 4th, Labor Day)
- School calendar (Penn, Drexel, Temple in session?)
- Special events (concerts, sports games, conventions)
- Day of month (payday effects?)
*Weather features:*
- Feels-like temperature (wind chill/heat index)
- "Perfect biking weather" indicator (60-75°F, no rain)
- Precipitation forecast (not just current)
- Weekend + nice weather interaction
*Spatial features:*
- Distance to Center City
- Distance to nearest university
- Distance to nearest park
- Points of interest nearby (restaurants, offices, bars)
- Station capacity
- Bike lane connectivity
*Trip history features:*
- Rolling 7-day average demand
- Same hour last week
- Station "type" clustering (residential, commercial, tourist)
**Implementation:**
- Add your features to the best model
- Compare MAE before and after
- Explain *why* you chose these features
- Did they improve predictions? Where?
**Try a poisson model for count data**
- Does this improve model fit?
------------------------------------------------------------------------
## Conclusion: Critical Reflection
Write 1-2 paragraphs addressing:
1. **Operational implications:**
- Is your final MAE "good enough" for Indego to use?
- When do prediction errors cause problems for rebalancing?
- Would you recommend deploying this system? Under what conditions?
2. **Equity considerations:**
- Do prediction errors disproportionately affect certain neighborhoods?
- Could this system worsen existing disparities in bike access?
- What safeguards would you recommend?
3. **Model limitations:**
- What patterns is your model missing?
- What assumptions might not hold in real deployment?
- How would you improve this with more time/data?
------------------------------------------------------------------------
### Brief report summarizing (with supporting data & visualization):
- Your quarter and why you chose it
- Model comparison results
- Error analysis insights
- New features you added and why
- Critical reflection on deployment